In this talk, I will report on the results of the SCALE workshop that was held at the JHU Human Language Technology Center of Excellence. This summer I worked for 8 weeks alongside 18 other researchers on improving the quality of Urdu-English machine translation.
Working with Urdu is different than working with Arabic and Chinese (which are the languages that the DARPA GALE program focuses on) because there is a very limited amount of bilingual training data available for Urdu (1 million words v. 200 million for the other languages), and because Urdu is an SOV language. Because of these reasons it requires a different approach than the phrase-based models that have been employed in GALE.
I describe how incorporating syntactically information into our statistical models of translation resulted in significantly improved translation quality. While previous research on incorporating syntax into translation have shown mixed results, we found striking improvements (over the summer our translation quality improved by 6 Bleu points, which is the single biggest improvement in translation quality that I have got in my career so far). I will also show our experiments with transliteration and with integrating semantics entities into translation.